Goto

Collaborating Authors

 modeling tabular data


Modeling Tabular data using Conditional GAN

Neural Information Processing Systems

Modeling the probability distribution of rows in tabular data and generating realistic synthetic data is a non-trivial task. Tabular data usually contains a mix of discrete and continuous columns. Continuous columns may have multiple modes whereas discrete columns are sometimes imbalanced making the modeling difficult. Existing statistical and deep neural network models fail to properly model this type of data. We design CTGAN, which uses a conditional generative adversarial network to address these challenges. To aid in a fair and thorough comparison, we design a benchmark with 7 simulated and 8 real datasets and several Bayesian network baselines. CTGAN outperforms Bayesian methods on most of the real datasets whereas other deep learning methods could not.


Reviews: Modeling Tabular data using Conditional GAN

Neural Information Processing Systems

Originality: The main originality of the paper is a data transformation process applied to tabular data so a GAN can learn from them. This is definitely higher novel and can be potentially useful in similar situations involving such distributions. Apart from this, however, I feel that the authors are overclaiming a bit regarding several challenge/contributions: -C2 (L86): The choice of activation function certainly depends on the data format, listing that as a "challenge" seems a bit too much to me, unless the authors can point out non-trivial adaptations they made to address the problem (and apologize if I missed that...) -C4 (L98): again, hardly something new -C5 (L105): mode collapse is certainly well studied in literature (speaking of which, the authors should add references on newer approaches such as BourGAN), using an off-the-shelf solution (PacGAN), again, does not seem to me as an important contribution. Rephrasing the section and focus on the important contributions (C3, and perhaps C1) will make the contributions of the paper more clear, in my opinion. Quality: The paper is of high quality and the description of techniques is sound.


Modeling Tabular data using Conditional GAN

Neural Information Processing Systems

Modeling the probability distribution of rows in tabular data and generating realistic synthetic data is a non-trivial task. Tabular data usually contains a mix of discrete and continuous columns. Continuous columns may have multiple modes whereas discrete columns are sometimes imbalanced making the modeling difficult. Existing statistical and deep neural network models fail to properly model this type of data. We design CTGAN, which uses a conditional generative adversarial network to address these challenges.


Comparative Analysis of Transformers for Modeling Tabular Data: A Casestudy using Industry Scale Dataset

Singh, Usneek, Arora, Piyush, Ganesan, Shamika, Kumar, Mohit, Kulkarni, Siddhant, Joshi, Salil R.

arXiv.org Artificial Intelligence

We perform a comparative analysis of transformer-based models designed for modeling tabular data, specifically on an industry-scale dataset. While earlier studies demonstrated promising outcomes on smaller public or synthetic datasets, the effectiveness did not extend to larger industry-scale datasets. The challenges identified include handling high-dimensional data, the necessity for efficient pre-processing of categorical and numerical features, and addressing substantial computational requirements. To overcome the identified challenges, the study conducts an extensive examination of various transformer-based models using both synthetic datasets and the default prediction Kaggle dataset (2022) from American Express. The paper presents crucial insights into optimal data pre-processing, compares pre-training and direct supervised learning methods, discusses strategies for managing categorical and numerical features, and highlights trade-offs between computational resources and performance. Focusing on temporal financial data modeling, the research aims to facilitate the systematic development and deployment of transformer-based models in real-world scenarios, emphasizing scalability.


PTab: Using the Pre-trained Language Model for Modeling Tabular Data

Liu, Guang, Yang, Jie, Wu, Ledell

arXiv.org Artificial Intelligence

Tabular data is the foundation of the information age and has been extensively studied. Recent studies show that neural-based models are effective in learning contextual representation for tabular data. The learning of an effective contextual representation requires meaningful features and a large amount of data. However, current methods often fail to properly learn a contextual representation from the features without semantic information. In addition, it's intractable to enlarge the training set through mixed tabular datasets due to the difference between datasets. To address these problems, we propose a novel framework PTab, using the Pre-trained language model to model Tabular data. PTab learns a contextual representation of tabular data through a three-stage processing: Modality Transformation(MT), Masked-Language Fine-tuning(MF), and Classification Fine-tuning(CF). We initialize our model with a pre-trained Model (PTM) which contains semantic information learned from the large-scale language data. Consequently, contextual representation can be learned effectively during the fine-tuning stages. In addition, we can naturally mix the textualized tabular data to enlarge the training set to further improve representation learning. We evaluate PTab on eight popular tabular classification datasets. Experimental results show that our method has achieved a better average AUC score in supervised settings compared to the state-of-the-art baselines(e.g. XGBoost), and outperforms counterpart methods under semi-supervised settings. We present visualization results that show PTab has well instance-based interpretability.


Modeling Tabular data using Conditional GAN

Xu, Lei, Skoularidou, Maria, Cuesta-Infante, Alfredo, Veeramachaneni, Kalyan

Neural Information Processing Systems

Modeling the probability distribution of rows in tabular data and generating realistic synthetic data is a non-trivial task. Tabular data usually contains a mix of discrete and continuous columns. Continuous columns may have multiple modes whereas discrete columns are sometimes imbalanced making the modeling difficult. Existing statistical and deep neural network models fail to properly model this type of data. We design CTGAN, which uses a conditional generative adversarial network to address these challenges.